To conduct our analysis, we did a series of tests to compare the cropped version of a scan against the full version. We constructed scatter and box plots as well as ROC curves for visual analysis We trained a Generalized Linear Model (glm) to test the p scores, or importance, of each variable in the model, and constructed kernel density graphs to compare the scans. Every cropped scan and its full scan counterpart were found to have more than 90% correlation with each other, requiring us to use only one of them for each feature to avoid co-linearity issues.
Extract NA, a measure of the total missing value count in the scan over the total value count. This feature performed better when it is cropped with no reservations. The ROC curve and p score in the glm were both better for the cropped image. The kernel density graph was also more distinct for the cropped image.
Assess Bottomempty, a measure of missing value count in the bottom 20% of the scan compared to the total value count in the same area. This feature performed better when it is cropped with no reservations. The ROC curve and p score in the glm were both better for the cropped image. The kernel density graph was also more distinct for the cropped image.
Assess Col NA, the proportion of columns in the scans matrix which have more than 20% missing values. This feature performed only marginally better when cropped. The ROC area difference is only 0.008, and the p scores for both features were found to be extremely significant. The kernal density graphs are very similar with the cropped graph being slightly more distinct.
Assess Median NA Proportion, calculates the mean number of NA’s in each column, and then finds the median out of all those values. This feature performed better when left as the full scan. The ROC area for the cropped was better by only 0.001, the p values for both features were found to be extremely significant, but the full scan was lower. The kernel density graph was more distinct for the full images.
| Correlation | pvalue_Full | pvalue_cropped | auc_Full | auc_Cropped | |
|---|---|---|---|---|---|
| Extract NA | 0.908 | 0.229 | <2e-16 | 0.871 | 0.902 |
| Assess Bottomempty | 0.905 | 5.86e-10 | <2e-16 | 0.783 | 0.859 |
| Assess Col NA | 0.919 | 1.62e-06 | 1.17e-14 | 0.888 | 0.896 |
| Assess Median NA Proportion | 0.908 | <2e-16 | 6.77e-10 | 0.907 | 0.863 |
All of the features had significant outliers for their individual predictive power. We investigated the type I errors, false positives of a bad scan being predicted as good, and found them all to be “Tiny Problems” scans which could reasonably be re-classified as “problematic” or worse.
This analysis is to compare the difference between the cropped versus non-cropped (full) version of a scan for quality identification. Cropped images have the potential for decreasing noise around the signal. The level of cropping we are considering is 5% from the left and right sides, and 10% off of the top of the image. In particular, we want to preserve the bottom of the image and the center as that is where most of the signal is. Below are examples of a full, full with marked edges, and a cropped image.
xxx Overview of features with links (anchors) We have identified four potential features of our scan quality assessor which could benefit from being cropped. XXX move mathematical definitions that you need for multiple features here. Give the definitions numbers or names and then refer to these definitions as needed below.
The function extract_na calculates the percentage of
missing values in the scan (part) under observation, e.g. for scan
surface matrix \(X \in {\rm I\!R}^{m,
n}\) the percentage of missing values is defined as:
Let \(A=\{NA\}\) be the set of undefined values. For simplicity of notation we will assume that the space of real values \({\rm I\!R}\) contains \(A\): \({\rm I\!R}:= {\rm I\!R} \cup A\).
With that, let \(X \in {\rm I\!R}^{m,n}\) be a real-valued surface matrix of dimensions m x n where m and n are strictly positive integer values, \(X = (x_{ij})_{1 \leq i \leq m, 1 \leq j \leq n}\) The proportion of missing values in X is then defined as: \[ \frac{1}{m*n} \sum^m_{i=1} \sum^n_{j=1} \theta_A(x_{ij}) \\ \text{Where } \theta_A(x) = \left\{\begin{aligned} &1 &&: \text{if }x \in A\\ &0 &&: \text{otherwise}\\ \end{aligned} \right. \] Assess Bottomempty
The feature assess_bottomempty calculates the percentage
of missing values in the bottom 20% of the scan.
Let \(A=\{NA\}\) be the set of undefined values. For simplicity of notation we will assume that the space of real values \({\rm I\!R}\) contains \(A\): \({\rm I\!R}:= {\rm I\!R} \cup A\).
With that, let \(X \in R^{m,n}\) be a real-valued surface matrix of dimensions m x n where m and n are strictly positive integers \(X = (x_{ij})_{1 \leq i \leq m, 1 \leq j \leq n}\).
Let \(R \subseteq {\rm I\!R}\) be a set of size m, where each element is the sum of the NA’s for the given row, defined as: \[ \forall i \in R: R_i = \sum^n_{j=1} \theta_A(x_{ij}) \\ \text{Where } \theta_A(x) = \left\{\begin{aligned} &1 &&: \text{if }x \in A\\ &0 &&: \text{otherwise}\\ \end{aligned} \right. \]
Let \(B \subset R\) be a set, which is the set of all values in \(R_i\), given that \(i \geq m*0.8\). Therefore, the proportion of missing values in \(X\)’s bottom 20% can be given by: \[ \frac{1}{m*n*0.2}\sum_{i=1}^{m*0.2}(R_i)*100 \] Assess Col NA
The function assess_col_na calculates the percentage of
missing values
For every column in the matrix of a scan, we find the proportion of scans in that column which are NA. Then we count how many of the columns whose proportion is greater than 20%, the pre-determined threshold of acceptable NA’s. Then we divide by the number of columns * 0.2 to get our final threshold adjusted number.
Let \(A=\{NA\}\) be the set of undefined values. For simplicity of notation we will assume that the space of real values \({\rm I\!R}\) contains \(A\): \({\rm I\!R}:= {\rm I\!R} \cup A\).
With that, let \(X \in R^{m,n}\) be a real-valued surface matrix of dimensions m x n where m and n are strictly positive integers \(X = (x_{ij})_{1 \leq i \leq m, 1 \leq j \leq n}\).
Let \(R \subseteq {\rm I\!R}\) be a set of size n, where each element is the sum of the NA’s for the given column, defined as: \[ \forall i \in R: R_i = \sum^m_{j=1} \theta_A(x_{ij}) \\ \text{Where } \theta_A(x) = \left\{\begin{aligned} &1 &&: \text{if }x \in A\\ &0 &&: \text{otherwise}\\ \end{aligned} \right. \]
We define \(P\) as the proportion of all NAs per column for every row, as defined here: \[ \forall i \in R: P_i = \frac{R_i}{n} * 100 \]
We now find the proportion of threshold adjusted columns in the matrix \[ \frac{\sum_{i=1}^n(P_i*\beta_B(P_i))}{n*0.2} \\ \text{Where } \beta_B(x) = \left\{\begin{aligned} &1 &&: \text{if }x > 20\\ &0 &&: \text{otherwise}\\ \end{aligned} \right. \]
The function assess_median_na_proportion calculates the
mean number of NA’s in each column, and then finds the median out of all
those values.
Let \(A=\{NA\}\) be the set of undefined values. For simplicity of notation we will assume that the space of real values \({\rm I\!R}\) contains \(A\): \({\rm I\!R}:= {\rm I\!R} \cup A\).
With that, let \(X \in R^{m,n}\) be a real-valued surface matrix of dimensions m x n where m and n are strictly positive integers \(X = (x_{ij})_{1 \leq i \leq m, 1 \leq j \leq n}\).
Let \(R \subseteq {\rm I\!R}\) be a set of size n, where each element is the mean of the NA’s for the given column, defined as: \[ \forall i \in R: R_i = \frac{\sum^m_{j=1} \theta_A(x_{ij})}{m} \\ \text{Where } \theta_A(x) = \left\{\begin{aligned} &1 &&: \text{if }x \in A\\ &0 &&: \text{otherwise}\\ \end{aligned} \right. \]
We then sort, and select the median of \(R\)
## [1] "Extract NA. Correlation: 0.908 Full AUC: 0.871 Cropped AUC: 0.902"
| min | firstQ | med | mean | thirdQ | max | |
|---|---|---|---|---|---|---|
| Standard | 3.666645 | 12.79727 | 15.082281 | 15.915471 | 18.30296 | 48.45372 |
| Cropped | 1.389464 | 6.38338 | 8.060005 | 8.916349 | 10.57698 | 40.80471 |
##
## Call:
## glm(formula = GoodScan ~ extract_na + extract_na_cropped, family = binomial(),
## data = full_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1882 -0.2394 0.3044 0.5298 4.0795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.16247 0.37858 18.919 <2e-16 ***
## extract_na -0.04349 0.03613 -1.204 0.229
## extract_na_cropped -0.58303 0.05020 -11.615 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2230.6 on 1849 degrees of freedom
## Residual deviance: 1299.2 on 1847 degrees of freedom
## AIC: 1305.2
##
## Number of Fisher Scoring iterations: 6
The values for feature extract_NA are highly correlated
between the cropped and the full scan.
Using good and scans with only tiny problems as overall ‘good’ scans, the feature applied to cropped scans has an increased accuracy compared to the feature values from the full scan.
We might want to follow up on the orange colored scans:
full_data$LAPD_id[full_data$followup]
## [1] "FAU263-BA-L4" "FAU263-BC-L1" "FAU263-BC-L3" "FAU287-BC-L5" "FAU154-BD-L2"
## [6] "FAU277-BA-L4" "FAU286-BA-L5"
followupScans <- rbind(followupScans, full_data[full_data$followup == TRUE,])
# All followups for extract_na are mislabelled scans. They are all labelled as tiny problems but should be problematic or worse.
## [1] "Assess Bottomempty. Correlation: 0.905 Full AUC: 0.783 Cropped AUC: 0.859"
| min | firstQ | med | mean | thirdQ | max | |
|---|---|---|---|---|---|---|
| Standard | 8.516493 | 22.66947 | 27.54019 | 29.90259 | 34.63373 | 95.39375 |
| Cropped | 3.564558 | 10.78581 | 13.54205 | 15.28335 | 17.71191 | 84.22476 |
##
## Call:
## glm(formula = GoodScan ~ assess_bottomempty + assess_bottomempty_cropped,
## family = binomial(), data = full_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6794 -0.2813 0.4084 0.6028 4.0680
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.57647 0.25156 18.193 < 2e-16 ***
## assess_bottomempty 0.09046 0.01460 6.194 5.86e-10 ***
## assess_bottomempty_cropped -0.40779 0.02738 -14.895 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2230.6 on 1849 degrees of freedom
## Residual deviance: 1503.7 on 1847 degrees of freedom
## AIC: 1509.7
##
## Number of Fisher Scoring iterations: 5
The values for feature assess_bottomempty are highly
correlated between the cropped and the full scan.
Using good and scans with only tiny problems as overall ‘good’ scans, the feature applied to cropped scans has an increased accuracy compared to the feature values from the full scan.
We might want to follow up on the orange colored scans:
full_data$LAPD_id[full_data$followup]
## [1] "FAU263-BA-L4" "FAU287-BC-L5" "FAU254-BD-L4" "FAU275-BC-L5" "FAU275-BD-L3"
## [6] "FAU277-BA-L4" "FAU286-BA-L5"
followupScans <- rbind(followupScans, full_data[full_data$followup == TRUE,])
## [1] "Assess Col NA Correlation: 0.919 Full AUC: 0.888 Cropped AUC: 0.896"
| min | firstQ | med | mean | thirdQ | max | |
|---|---|---|---|---|---|---|
| Standard | 0.2573318 | 0.9244176 | 1.0755102 | 1.1654392 | 1.292844 | 4.368436 |
| Cropped | 0.0920783 | 0.4700281 | 0.6126759 | 0.6851693 | 0.807577 | 3.356120 |
##
## Call:
## glm(formula = GoodScan ~ assess_col_na + assess_col_na_cropped,
## family = binomial(), data = full_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2254 -0.1897 0.3176 0.5365 3.6049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.3581 0.3785 19.440 < 2e-16 ***
## assess_col_na -2.4724 0.5155 -4.796 1.62e-06 ***
## assess_col_na_cropped -4.8233 0.6249 -7.719 1.17e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2230.6 on 1849 degrees of freedom
## Residual deviance: 1308.8 on 1847 degrees of freedom
## AIC: 1314.8
##
## Number of Fisher Scoring iterations: 6
The values for feature assess_col_na are highly
correlated between the cropped and the full scan.
Using good and scans with only tiny problems as overall ‘good’ scans, the feature applied to cropped scans has an increased accuracy compared to the feature values from the full scan.
We might want to follow up on the orange colored scans:
full_data$LAPD_id[full_data$followup]
## [1] "FAU263-BA-L4" "FAU263-BB-L3" "FAU263-BC-L1" "FAU263-BC-L3" "FAU154-BD-L2"
## [6] "FAU286-BA-L5"
followupScans <- rbind(followupScans, full_data[full_data$followup == TRUE,])
## [1] "Assess Median NA Proportion. Correlation: 0.908 Full AUC: 0.907 Cropped AUC: 0.863"
| min | firstQ | med | mean | thirdQ | max | |
|---|---|---|---|---|---|---|
| Standard | 0.0000000 | 0.0033099 | 0.0108771 | 0.0207348 | 0.0271577 | 0.2552491 |
| Cropped | 0.0011692 | 0.0518201 | 0.0686747 | 0.0751096 | 0.0918769 | 0.2994505 |
##
## Call:
## glm(formula = GoodScan ~ assess_median_na_proportion + assess_median_na_proportion_cropped,
## family = binomial(), data = full_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8226 -0.1079 0.3143 0.4865 5.2612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.384 0.257 17.060 < 2e-16 ***
## assess_median_na_proportion -83.035 5.925 -14.014 < 2e-16 ***
## assess_median_na_proportion_cropped -21.546 3.491 -6.171 6.77e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2230.6 on 1849 degrees of freedom
## Residual deviance: 1219.0 on 1847 degrees of freedom
## AIC: 1225
##
## Number of Fisher Scoring iterations: 6
The values for feature extract_NA are highly correlated
between the cropped and the full scan.
Using good and scans with only tiny problems as overall ‘good’ scans, the feature applied to full scans has an increased accuracy compared to the feature values from the cropped scan.
We might want to follow up on the orange colored scans:
full_data$LAPD_id[full_data$followup]
## [1] "FAU263-BC-L3" "FAU154-BD-L2" "FAU204-BC-L4"
followupScans <- rbind(followupScans, full_data[full_data$followup == TRUE,])
followupUnique <- followupScans[duplicated(followupScans) == FALSE,]
followupScans %>% group_by(followupScans$LAPD_id) %>% summarize(
count = n()
)
## # A tibble: 12 x 2
## `followupScans$LAPD_id` count
## <chr> <int>
## 1 FAU154-BD-L2 3
## 2 FAU204-BC-L4 1
## 3 FAU254-BD-L4 1
## 4 FAU263-BA-L4 3
## 5 FAU263-BB-L3 1
## 6 FAU263-BC-L1 2
## 7 FAU263-BC-L3 3
## 8 FAU275-BC-L5 1
## 9 FAU275-BD-L3 1
## 10 FAU277-BA-L4 2
## 11 FAU286-BA-L5 3
## 12 FAU287-BC-L5 2
# 3 hits: FAU154-BD-L2, FAU263-BA-L4, FAU263-BC-L3, FAU286-BA-L5
# 2 hits: FAU263-BC-L1, FAU277-BA-L4, FAU287-BC-L5
# 1 hits: FAU204-BC-L4, FAU254-BD-L4, FAU263-BB-L3, FAU275-BC-L5, FAU275-BD-L3
FAU263_BA_L4 <- x3p_read("../data/followup_scans/LAPD - 263 - Bullet A - Land 4 - Sneox2 - 20x - auto light left image + 20% - threshold 2 - resolution 4 - Connor Hergenreter.x3p")
FAU263_BC_L1 <- x3p_read("../data/followup_scans/LAPD - 263 - Bullet C - Land 1 - Sneox2 - 20x - auto light left image + 20% - threshold 2 - resolution 4 - Connor Hergenreter.x3p")
FAU263_BC_L3 <- x3p_read("../data/followup_scans/LAPD - 263 - Bullet C - Land 3 - Sneox2 - 20x - auto light left image + 20% - threshold 2 - resolution 4 - Connor Hergenreter.x3p")
FAU277_BA_L4 <- x3p_read("../data/followup_scans/LAPD - 277 - Bullet A - Land 4 - Sneox2 - 20x - auto light left image + 20% - threshold 2 - resolution 4 - Connor Hergenreter.x3p")
x3p_image(FAU263_BA_L4, file="./Comparative-Analysis_files/figure-html/FAU263_BA_L4.png")
x3p_image(FAU263_BC_L1, file="./Comparative-Analysis_files/figure-html/FAU263-BC-L1.png")
x3p_image(FAU263_BC_L3, file="./Comparative-Analysis_files/figure-html/FAU263_BC_L3.png")
x3p_image(FAU277_BA_L4, file="./Comparative-Analysis_files/figure-html/FAU277-BA-L4.png")
| LAPD.ID | Hit.Count | Current.Quality | Current.Problem | Recommended.Quality |
|---|---|---|---|---|
| FAU263-BA-L4 | 3 | Tiny Problems | Feathering | NA |
| FAU263-BC-L3 | 3 | Tiny Problems | Feathering | NA |
| FAU154-BD-L2 | 3 | Tiny Problems | Holes | NA |
| FAU286-BA-L5 | 3 | Tiny Problems | Holes | NA |
| FAU263-BC-L1 | 2 | Tiny Problems | Feathering | NA |
| FAU287-BC-L5 | 2 | Tiny Problems | Feathering | NA |
| FAU277-BA-L4 | 2 | Tiny Problems | Holes | NA |
| FAU254-BD-L4 | 1 | Tiny Problems | Holes | NA |
| FAU275-BC-L5 | 1 | Tiny Problems | Holes | NA |
| FAU275-BD-L3 | 1 | Tiny Problems | Holes | NA |
| FAU263-BB-L3 | 1 | Tiny Problems | Feathering | NA |
| FAU204-BC-L4 | 1 | Tiny Problems | Holes | NA |
FAU154_BD_L2 (3 hits, Tiny Problems, Holes):
Significant feathering across image.
FAU204_BC_L4 (1 hits, Tiny Problems, Holes):
Feathering on each end of image, rotational issues on left edge, holes
in the center
FAU254_BD_L4 (1 hits, Tiny Problems, Holes):
Significant missing values on the bottom, holes across the center,
missing section on right side.
FAU263_BA_L4 (3 hits, Tiny Problems,
Feathering): Significant feathering on right side image, missing most of
the left, and many missing values on bottom
FAU263_BC_L1 (2 hits, Tiny Problems,
Feathering): Significant feathering on right hand side, left side is
missing most of the values, then feathering, then holes as it moves
towards the middle. Bottom is also speckled with missing values.
FAU263_BC_L3 (3 hits, Tiny Problems,
Feathering): Contains significant feathering, holes, disproportionate
edges and missing values at the bottom.
FAU275_BC_L5 (1 hits, Tiny Problems,
Holes):
FAU275_BD_L3 (1 hits, Tiny Problems,
Holes):
FAU277_BA_L4 (2 hits, Tiny Problems, Holes):
A few holes, significant missing values on the left, right, and
bottom.
FAU286_BA_L5 (3 hits, Tiny Problems,
Holes):